CONVERGENCE OF SIMULATION-BASED POLICY ITERATION
نویسندگان
چکیده
منابع مشابه
Convergence of Simulation-Based Policy Iteration
Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide...
متن کاملConvergence Properties of Policy Iteration
This paper analyzes asymptotic convergence properties of policy iteration in a class of stationary, infinite-horizon Markovian decision problems that arise in optimal growth theory. These problems have continuous state and control variables and must therefore be discretized in order to compute an approximate solution. The discretization may render inapplicable known convergence results for poli...
متن کاملConvergence Analysis of Policy Iteration
Adaptive optimal control of nonlinear dynamic systems with deterministic and known dynamics under a known undiscounted infinite-horizon cost function is investigated. Policy iteration scheme initiated using a stabilizing initial control is analyzed in solving the problem. The convergence of the iterations and the optimality of the limit functions, which follows from the established uniqueness o...
متن کاملMultilevel simulation based policy iteration for optimal stopping – convergence and complexity∗
This paper presents a novel approach to reduce the complexity of simulation based policy iteration methods for solving optimal stopping problems. Typically, Monte Carlo construction of an improved policy gives rise to a nested simulation algorithm. In this respect our new approach uses the multilevel idea in the context of the nested simulations, where each level corresponds to a specific numbe...
متن کاملOn the Convergence of Optimistic Policy Iteration
We consider a finite-state Markov decision problem and establish the convergence of a special case of optimistic policy iteration that involves Monte Carlo estimation of Q-values, in conjunction with greedy policy selection. We provide convergence results for a number of algorithmic variations, including one that involves temporal difference learning (bootstrapping) instead of Monte Carlo estim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Probability in the Engineering and Informational Sciences
سال: 2003
ISSN: 0269-9648,1469-8951
DOI: 10.1017/s0269964803172051